On Separation of English Numerals from Multilingual Document Images
نویسندگان
چکیده
For Optical Character Recognition (OCR) of bilingual or multilingual document containing text words in regional language and numerals in English, it is necessary to identify different script forms before running an individual OCR of the scripts. In this paper, an attempt is made for separation of English numerals at word level from bilingual and trilingual documents representing Kannada, Devnagari, Tamil, Odiya and Malayalam scripts by using discriminating features such as aspect ratio, strokes densities, eccentricity, etc. as a tool. The k-nearest neighbour algorithm is used to classify the new word images and the algorithm is tested on 6000 sample words with a five fold cross validation test. The algorithm is robust with respect to font styles, sizes and noise. The results obtained are quite encouraging.
منابع مشابه
Identification of Printed Punjabi Words and English Numerals Using Gabor Features
Script identification is one of the challenging steps in the development of optical character recognition system for bilingual or multilingual documents. In this paper an attempt is made for identification of English numerals at word level from Punjabi documents by using Gabor features. The support vector machine (SVM) classifier with five fold cross validation is used to classify the word imag...
متن کاملReflections by Robert Phillipson on English in Post-Revolutionary Iran. From Indigenization to Internationalization, M. Borjian (2013) Multilingual Matters, ISBN 978-1847699091
متن کامل
Script Identification from Printed Document Images Using Statistical Features
Automatic identification of a script in a document image facilitates many important applications such as automatic archiving of multilingual documents; searching online archives of document images and for the selection of script specific OCR in a multilingual environment. In this work a technique for script identification from document images is proposed. The method uses vertical and horizontal...
متن کاملMonothetic Separation of Telugu, Hindi and English Text Lines From a Multilingual
In a multi-script multi-lingual environment, a document may contain text lines in more than one script/language forms. It is necessary to identify different script regions of the document in order to feed the document to the OCRs of individual language. With this context, this paper proposes to develop a monothetic algorithmic model to identify and separate text lines Telugu, Hindi and English ...
متن کاملرفع اعوجاج هندسی متون بهکمک اطلاعات هندسی خطوط متن
Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of Multimedia
دوره 2 شماره
صفحات -
تاریخ انتشار 2007